🌻 ! Autoclustering with embeddings#

22 Sep 2025

This is old material but can be re-used

Summary#

Motivation#

Often we have more labels that we really want. People like simple maps with 3-10 factors. 30 is about the maximum number you can possibly squeeze onto a map, and if you do you’ll not want more than 30 or 40 links.

BUT for overview summaries, we want to provide maps with a high link coverage, close to 100%, meaning that nearly all the original links are actually included in the current map. Otherwise, we are just showing a small selection of what people actually said.

One way to do this is with zooming. It’s a way to reduce the number of factors while retaining the link coverage (because we don’t throw away links, we only reroute them to higher level factors).

BUT zooming requires hierarchical factor labelling.

Disadvantages:

Solution#

The idea of autoclustering is to take all the factor labels and group them into clusters with similar meaning using embeddings.

Below are questions interesting to ask when you are looking at the list of clusters:

To do this, you may also look at the contents of each cluster (the raw labels assigned to them) in case you think any of the cluster labels are missing the point of the main contents or putting a wrong accent on them. But don't worry at this stage if you see a few raw labels which don't exactly fit the clusters. We can deal with those later. At this stage our focus is only on getting the cluster labels right.

Labelling#

As well as clustering the factors, the app then also tries to choose a good label for them. It chooses the label of one of the factors in each cluster as a label for the whole cluster, picking one which is both typical in meaning but also short. Overall:

Untitled

You can autocluster at any point in the transform chain, so for example it is also possible to zoom to level 1 first and then autocluster just the top-level labels.

Saving your label sets#

It is not possible to reassign factors between clusters, and you probably won’t want to. But for publication and presentation purposes, you may well want to provide nicer labels than those auto generated by the app.

If you want to make changes to the labels, you can save them in a label set.

The label sets are saved to the server but are separate from the file. So they are not, for example, restored if you restore an earlier version of the file.

Label sets should get applied correctly if you save a view of your file, using a particular label set, in 📚 The Library.

Each label set has a unique number.

You can create and edit label sets either from the factor modal which you see either when clicking on a factor or by clicking Manage button in the transforms widget:

Untitled

As soon as you press the Save icon in either dialog:

From the Manage dialog, you can even save your changes to a different label set:

Untitled

For each file you can have as many “relabel sets” as you want.

If you select an existing label set, then the set with the number you selected becomes active, and the factors will be grouped into clusters specified in your selected label set and given the label specified in that label set. The granularity slider is then ignored.

Why magnetic labels#

You have already coded your dataset, manually or using AI, and now you want to relabel.

Suppose you already know what labels you want to use, perhaps:

Magnetic labels are a really simple solution for these cases.

How to use them#

Untitled

You simply type the list of magnetic labels you want and decide on the power of the magnets (”magnetism”).

Magnetic labels attract existing labels of similar meaning, essentially relabelling these old labels with the new magnetic label. If an existing label is similar to two or more different labels, it is relabelled with the magnetic label it is most similar to.

If you use low magnetism, the magnets are weak and only attract existing labels which are very similar to them.

Increasing the magnetism means that more and more existing labels are attracted to the magnetic labels.

Existing labels which are not attracted to any label are unchanged. This means that you can easily see if your magnetic labels cover most of the original content.

Best practice is then, after applying magnetic labels, to then auto-cluster the links in order to pick out important themes which are not covered by the magnetic labels.

If you want you can even include hierarchical magnetic labels like Health behaviour; hand washing.

Storing your labels#

Your magnetic labels are included if you save a bookmark aka saved View.

You can also store your preferred label set in the Codebook in The Files tab.

Use cases#

Tips#

What if you have a semantically ambiguous label?#

An example: you have some research about animals and you want to look for mentions of the organisation Animal Aid. If you use Animal Aid as a label, it might also pick up any mention of helping animals which have nothing to do with the organisation itself.

One way to get round this is to use 🔗 The Manage Links tab to permanently recode any mention of Animal Aid in your factor labels into something unambiguous like, say, The Archibald Organisation. Choose a meaningless name which is not going to appear in or be related to to the rest of your material.

When doing this “hard recoding” remember to recode AnimalAid, Animals Aid etc as well.

Attracting unwanted material away from your map#

You can add an factor into magnetic clusters even if it doesn't appear in the final map.

For example you might have a lot of material about blood donors and you don’t want material about donating clothes. As well as donating blood you might add the labels donating goods and donating money. You can filter these out later, but they will help restrict donating blood to what you want.

Increasing coverage with hierarchical magnetic labels#

It might just be that your interview material is so heterogeneous that, however you choose your magnetic labels, if you only have say 10 or 20 of them then they are just not going to cover more than say 30% of your links in all of your stories and that's just the way it is because the material is very broad.

You might have hoped to arrive at a kind of global mind map - but the best you can do is just these most frequently mentioned common factors. You'd have to accept that it isn't in any sense a summary of all of the material because there's lots of other stuff that doesn't feature amongst the top 10 or 20 magnetic labels.

You might then also want to focus on more specific maps for more specific subject areas.

Or maybe you have a sense that in fact much of the material really is held in common but you're struggling to find the right magnetic labels? One way to increase coverage is to use hierarchical magnetic labels, of which you might have even 30 or 60 or even 100, and then zoom out to level one. So you might have, say, magnets like:

Desire for innovation; digital

Desire for innovation; management approaches

....

And then you'd apply a zoom level of 1 in order to bundle these things together.